Dataset: Medical Information Mart for Intensive Care-III (MIMIC-III)

On the development and validation of large language model-based classifiers for identifying social determinants of health

NLP Tasks: Text Classification, Information Extraction

Method: LLM-based classifiers using Bidirectional Encoder Representations from Transformers (BERT) and A Robustly Optimized BERT Pretraining Approach (RoBERTa)

Metrics:

  • Area under the receiver operating characteristics curve for homelessness (0.78)
  • Area under the receiver operating characteristics curve for food insecurity (0.72)
  • Area under the receiver operating characteristics curve for domestic violence (0.83)

Extraction of Substance Use Information From Clinical Notes: Generative Pretrained Transformer-Based Investigation

NLP Tasks: Text Classification, Information Extraction, Question Answering, Text Generation

Method: the generative pretrained transformer (GPT) model in specific GPT-3.5

Metrics:

  • Accuracy (high accuracy in zero-shot learning)
  • Recall (improved in few-shot learning)
  • F1-score (enhanced in few-shot learning)
  • Precision (lower in few-shot learning)

The potential and pitfalls of using a large language model such as ChatGPT, GPT-4, or LLaMA as a clinical assistant

NLP Tasks: Text Classification, Information Extraction, Question Answering

Method: Evaluation of ChatGPT, GPT-4, and LLaMA in identifying patients with specific diseases using gold-labeled Electronic Health Records (EHRs) from the MIMIC-III database.

Metrics:

  • F1-score (≥ 85% on COPD, CKD, and PBC)
  • F1-score (4.23% higher for PBC compared to traditional Machine Learning models)
  • Precision
  • Specificity
  • Sensitivity
  • Negative Predictive Value

ARDSFlag: an NLP/machine learning algorithm to visualize and detect high-probability ARDS admissions independent of provider recognition and billing codes

NLP Tasks: Text Classification, Information Extraction

Method: ARDSFlag algorithm using machine learning (ML) and natural language processing (NLP) techniques

Metrics:

  • Accuracy (91.9%±0.5% for bilateral infiltrates, 86.1%±0.5% for heart failure/fluid overload in radiology reports, 98.4%±0.3% for echocardiogram notes)
  • Overall accuracy (89.0%)
  • Specificity (91.7%)
  • Recall (80.3%)
  • Precision (75.0%)

Redefining Health Care Data Interoperability: Empirical Exploration of Large Language Models in Information Exchange

NLP Tasks: Information Extraction, Text Generation

Method: text-based approach facilitated by the LLM ChatGPT

Metrics:

  • Accuracy (over 99%)
  • Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
  • Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

Constructing synthetic datasets with generative artificial intelligence to train large language models to classify acute renal failure from clinical notes

NLP Tasks: Text Classification, Information Extraction, Question Answering

Method: A classifier using language models to identify acute renal failure.

Metrics:

  • Accuracy (over 99%)
  • Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
  • Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

Exposing Vulnerabilities in Clinical LLMs Through Data Poisoning Attacks: Case Study in Breast Cancer

NLP Tasks: Text Classification, Information Extraction, Question Answering

Method: nan

Metrics:

  • Accuracy (over 99%)
  • Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
  • Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

A Large Language Model Screening Tool to Target Patients for Best Practice Alerts: Development and Validation

NLP Tasks: Text Classification

Method: AI screening tool using the BioMed-RoBERTa model

Metrics:

  • Accuracy (over 99%)
  • Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
  • Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

Evaluation and mitigation of the limitations of large language models in clinical decision-making

NLP Tasks: Information Extraction, Text Classification, Question Answering

Method: Creating a framework to simulate a realistic clinical setting using a curated dataset based on the Medical Information Mart for Intensive Care database

Metrics:

  • Accuracy (over 99%)
  • Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
  • Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models

NLP Tasks: Information Extraction, Text Classification, Question Answering, Text Generation

Method: Evaluation of three popular large language models (LLMs): Bard, ChatGPT-3.5, and GPT-4, using various prompt strategies and a majority voting strategy.

Metrics:

  • Accuracy (over 99%)
  • Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
  • Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)

The shaky foundations of large language models and foundation models for electronic health records

NLP Tasks: Information Extraction, Text Classification, Question Answering

Method: conducting a narrative review and creating a taxonomy of foundation models trained on non-imaging EMR data

Metrics:

  • Accuracy (over 99%)
  • Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
  • Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)